Multiple Linear Regression with R

This image may not relate to this project at all. Source: endpoints.elysiumhealth.com. All images, data and R Script can be found here

This is a project I did in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC.

Prompt

Dr. Sam Parameter, author of the textbook Statistics for Poets, has been contemplating starting a new magazine, Popular Statistics, and needs to know if it will be profitable enough. You have augmented the original dataset of 50 magazines by measuring more characteristics of the magazines and their predictors that may be useful in understanding the one-page advertisement costs better. The variables are the following :

Magazine Name
Cost of a four-color, one-page ad
Circulation  (projected, in thousands)
Percent male among the predicted readership
Median household income of readership
Market = Sold in USA or outside USA

Your goal is to analyze the data with R using Multiple Linear Regression methods and choose the best model to explain the differences in advertising costs between the different titles and then to predict what Sam should be able to charge for the advertisements in the new magazine.

Steps

Data Preparation

> mag_data <- read.csv("MagazineAds.csv")
> #Import data 
> sum(is.na(mag_data))
> #Check missing values
> head(mag_data)

                    C1.T pagecost  circ percmale medianincome   Market
1                Audubon    25315  1645     51.1        38787 Domestic
2 Better Homes & Gardens   198000 34797     22.1        41933 Domestic
3          Business Week   103300  4760     68.1        63667 Domestic
4           Cosmopolitan    94100 15452     17.3        44237 Domestic
5                   Elle    55540  3735     12.5        47211 Domestic
6           Entrepreneur    40355  2476     60.4        47579 Domestic

We need to transform market variable into a dummy variable

> mag_data$domestic <- as.numeric(as.numeric(mag_data$Market)==1)
> #create a new dummy variable (domestic) to describe Market variable
> mag_data1 <- mag_data[,c(-1,-6)]
> #remove the  magazine name and the market columns

Data Exploration

Histogram

> for (i in 1:5){
    histi <- hist(mag_data1[,i], plot=FALSE)
    histi$density <-histi$counts/sum(histi$counts)*100
    plot(histi, freq = FALSE,main = paste("Histogram of" , colnames(mag_data1)[i]), xlab = colnames(mag_data1)[i])
  }
> #a loop that generates five histogram for 5 variables

Histogram of medianincome variable Histogram of domestic variable Histogram of percmale variable Histogram of domestic circ Histogram of domestic pagecost

The skewness is noticeable in three variables Median Income, One Ad Page Cost, and Circulation. They’re skewed right, with long tail and the peak of the histogram veers to the left. There seems to be more magazines with one ad page cost lower than the average of one ad page cost for all magazines. Same seems to be true for median income and circulation. There are more magazines with lower circulations and lower median income of reader.

> summary(mag_data1)
> #get the descriptive statistics for the variables
    pagecost           circ          percmale      medianincome      domestic     
 Min.   : 20180   Min.   : 1645   Min.   : 3.60   Min.   :31835   Min.   :0.0000  
 1st Qu.: 48875   1st Qu.: 3634   1st Qu.:15.70   1st Qu.:41413   1st Qu.:0.0000  
 Median : 71968   Median : 5510   Median :47.40   Median :46251   Median :1.0000  
 Mean   : 82386   Mean   :10793   Mean   :42.88   Mean   :48958   Mean   :0.7045  
 3rd Qu.: 99895   3rd Qu.:14528   3rd Qu.:65.55   3rd Qu.:58038   3rd Qu.:1.0000  
 Max.   :198000   Max.   :51925   Max.   :88.50   Max.   :72493   Max.   :1.0000

Some insights from discriptive statistics for varibles

Cost for one ad page varies mostly from $50,000 to $100,000
Projected circulation varies mostly from 3500 to 15000, but over 50% of the observations have a circulation of fewer than 6000
Over 50% of magazines in the sample targets readers with median income from $40,000 to $60,000
There are a significant number of magazines that target solely one reader gender(25% of the observations have a percentage of male reader of less than 15%; 25% of the observations have a percentage of male reader of more than 65%)
70% of the observations are domestic magazine

Correlation Matrix

> plot(mag_data1)
> #scatter plot

The relationship between pagecost and circ is strong and looks linear. This strong relationship makes an economic sense since the higher the circulation, the stronger the impact of the ad campaign, the higher the ad cost.

The relationship between percmale and medianincome is also noticeable. However, it does not look linear visually. Different from the magazine that targets solely female readers, the magazine that targets solely male readers also target male readers with higher income, probably higher social class.

There is also a slight relationship between medianincome and pagecost, as well as medianincome and circ. The magazines that have higher ad page cost or higher circulation tend to target readers with lower median income.

> round(cor(mag_data1),2)
> #correlation table with 2 decimal rounding
             pagecost  circ percmale medianincome domestic
pagecost         1.00  0.89    -0.15        -0.27     0.20
circ             0.89  1.00    -0.23        -0.44     0.21
percmale        -0.15 -0.23     1.00         0.55     0.17
medianincome    -0.27 -0.44     0.55         1.00     0.06
domestic         0.20  0.21     0.17         0.06     1.00

From the correlation table, we confirm very similar insights. Cost per one ad page is highly correlated with circulation number. Median Income is also strongly correlated with percentage of male readers. Median Income also has slight negative correlation with both ad page cost and circulation.

Model 1 - Multiple Linear Regression analysis using all the predictor variables

> M1 = lm(pagecost~.,data = mag_data1)
> summary(M1)

Call:
lm(formula = pagecost ~ ., data = mag_data1)

Residuals:
   Min     1Q Median     3Q    Max 
-46196 -14665    -23  16378  40272 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7179.4408 20738.1460   0.346    0.731    
circ             3.9633     0.3433  11.544 3.77e-14 ***
percmale       -32.4809   151.7827  -0.214    0.832    
medianincome     0.7001     0.4313   1.623    0.113    
domestic      -640.5486  7524.6056  -0.085    0.933    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21620 on 39 degrees of freedom
Multiple R-squared:  0.8013,    Adjusted R-squared:  0.7809 
F-statistic: 39.32 on 4 and 39 DF,  p-value: 3.436e-13

This regression model is useful because p-value is 3.436e-13. R-squared of 80.13% is decent, but there’s room for further improvement on the model. Also based on a 5% significance level, three out of the four explanatory variables are insignificant. p-value for Circulation is small, so we can say that Circulation is contributing significantly to the cost. I would recommend keeping circulation variable. All other variables'p-value are greater than .05, hence we can say they are not significant based on our confidence interval selection and should be removed. The dummy variable domestic should be kept.

> install.packages("car")
> library(car)
# Install "car" package for VIF testing
> vif(M1)
        circ     percmale medianincome     domestic 
    1.326767     1.483729     1.693686     1.109380

The VIF for each explanatory variable is low (<5), so there does not appear to be _**_multicollinearity.

Predict the pagecost for one observation

Predict what Sam should be able to charge for the advertisements in the new magazine, if his magazine has a projected predictor of 1,800,000 readers, 60 percent of which are male, with a median income of $80,000 and it will be sold internationally.

> newdata<-data.frame(circ=1800,percmale = 60,medianincome = 80000,domestic = 0)
> #create the data for Dr. Sam's magazine. Popular Statistics,  for prediction
> predict(M1,newdata = newdata, se.fit = TRUE, interval = "confidence")
$fit
       fit      lwr      upr
1 68372.49 41879.83 94865.15
$se.fit
[1] 13097.73
$df
[1] 39
$residual.scale
[1] 21620.7

We are 95% confident that the cost is from $41,879.83 to $94,865.15

Residual Analysis

> res_data1 <- residuals(M1)
> plot(res_data1, type = "l")
> abline(h=0,col="red")
#QQ plot to check normality of residuals
> qqnorm(res_data1)
> qqline(res_data1, col = "red")
> shapiro.test(res_data1)

    Shapiro-Wilk normality test

data:  res_data1
W = 0.98262, p-value = 0.7384

Residuals Plot QQ Plot

Residual plot shows homoscedasticity. The variance of the residuals seems constant. QQ plot shows normality of the data. Shapiro test shows that there is no evidence to reject the hypothesis that data is not normally distributed.

#plot residuals against predicted values
>predicted_data1 <-predict(M1)
> plot(predicted_data1,res_data1)
> abline(h=0,col="red")
> library(lmtest)
> dwtest(M1)

    Durbin-Watson test

data:  M1
DW = 2.1763, p-value = 0.7356
alternative hypothesis: true autocorrelation is greater than 0
> acf(res_data1)

Residuals vs Predicted ACF Plot

Durbin Watson test shows a p-value of 0.7356. So, we can not reject the null hypothesis of zero autocorrelation. However, ACF plot does not show any autocorrelation. When we plot residuals against predicted value, we recognize non-random pattern, and probably outliers. This violates linearity assumption. The non-random pattern in the residuals indicates that the deterministic portion (predictor variables) of the model is not capturing some explanatory information that is “leaking” into the residuals. There are some possible solutions for this problem:

Adding another explanatory variable that helps explain the pattern in the residual
Variable in the model needs to be transformed to explain the curvature
Adding interactions between explanatory variables

Often, when dealing with dependent variables that represent financial data (income, price, etc.), using the natural log of the dependent variable will help to alleviate problems. Re-run the Multiple Regression analysis using the natural log of the page cost variable.

Model 2 - Transforming response variable

> mag_data2 <- mag_data1
> mag_data2$pagecost <-log(mag_data2$pagecost)
> colnames(mag_data2)[1] <-"ln.pagecost"
> M2<-lm(ln.pagecost~.,data = mag_data2)
> summary(M2)

Call:
lm(formula = ln.pagecost ~ ., data = mag_data2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.76706 -0.21223  0.06683  0.27560  0.50088 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.013e+01  3.367e-01  30.103  < 2e-16 ***
circ          4.481e-05  5.573e-06   8.039  8.4e-10 ***
percmale     -1.265e-03  2.464e-03  -0.513   0.6106    
medianincome  1.258e-05  7.002e-06   1.797   0.0801 .  
domestic     -1.848e-02  1.222e-01  -0.151   0.8805    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.351 on 39 degrees of freedom
Multiple R-squared:  0.6521,    Adjusted R-squared:  0.6164 
F-statistic: 18.27 on 4 and 39 DF,  p-value: 1.573e-08

With p-value of 1.573e-08, this model is useful. However, the model does not seem better than the previous one, because:

The r-square and adjusted r-square are worse with only 0.6521 and 0.6164
The model has log transformation, which causes difficulty in interpretation
p-value is larger

However, the medianincome variable is relatively useful with p-valule of 0.08. I recommend keeping three variable circ, meidanincome, and domestic, which is a dummy variable.

Residual Analysis

> res_data2 <- residuals(M2)
> plot(res_data2, type = "l")
> abline(h=0,col="red")
> qqnorm(res_data2)
> qqline(res_data2, col = "red")
> predicted_data2 <-predict(M2)
> plot(predicted_data2,res_data2)
> abline(h=0,col="red")
> dwtest(M2)

    Durbin-Watson test

data:  M2
DW = 2.2303, p-value = 0.7915
alternative hypothesis: true autocorrelation is greater than 0

> acf(res_data2)
> shapiro.test(res_data2)

    Shapiro-Wilk normality test

data:  res_data2
W = 0.94891, p-value = 0.05011

Residuals Plot QQ Plot Residuals vs Predicted ACF Plot

Residuals plot does not show heteroskedasticity. The assumption of equal error variances is reasonable. QQ plot shows normality of the data. Durbin Watson test shows a p-value of 0.7915. So, we can not reject the null hypothesis of 0 autocorrelation. However, ACF plot does not show autocorrelation. The residuals plot still shows non-random pattern, and probably outliers. This violates linearity assumption.

The scatterplots again suggest that there is a high correlation between lncost and circulation. The higher the circulation, the higher the ln.cost. It's also noticeable that the relationship between these two are not linear, which suggests further transformation. All other plots do not show any patterns and are random in nature.

Model 3 - Transforming response variables and explanatory variables

In this model we take the natural log of circ variable.

> mag_data3 <- mag_data2
> mag_data3$circ <-log(mag_data3$circ)
> head(mag_data3)
  ln.pagecost      circ percmale medianincome domestic
1    10.13915  7.405496     51.1        38787        1
2    12.19602 10.457286     22.1        41933        1
3    11.54539  8.468003     68.1        63667        1
4    11.45211  9.645494     17.3        44237        1
5    10.92486  8.225503     12.5        47211        1
6    10.60547  7.814400     60.4        47579        1
> colnames(mag_data3)[2] <- "ln.circ"
> #Transform log of circ
> M3<-lm(ln.pagecost~., data = mag_data3)
> summary(M3)

Call:
lm(formula = ln.pagecost ~ ., data = mag_data3)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46001 -0.15107  0.01354  0.20299  0.38204 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.795e+00  5.507e-01   8.707 1.11e-10 ***
ln.circ       6.301e-01  4.798e-02  13.131 6.77e-16 ***
percmale      5.698e-04  1.734e-03   0.329  0.74418    
medianincome  1.652e-05  4.927e-06   3.352  0.00179 ** 
domestic     -5.732e-02  8.556e-02  -0.670  0.50687    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2457 on 39 degrees of freedom
Multiple R-squared:  0.8295,    Adjusted R-squared:  0.812 
F-statistic: 47.42 on 4 and 39 DF,  p-value: 1.801e-14

This model has a greater r-square value than all our previous models. The p-value is smaller, and the model is useful. Besides lncirc, medianincome variable, which was not that significant in our previous models, is behaving as a statistically significant variable in this model. We recommend keeping three variable circ, meidanincome, and domestic.

Residual Analysis

> res_data3 <- residuals(M3)
> plot(res_data3, type = "l")
> abline(h=0,col="red")
> qqnorm(res_data3)
> qqline(res_data3,col="red")
> predicted_data3 <-predict(M3)
> plot(predicted_data3,res_data3)
> abline(h=0,col="red")
> dwtest(M3)

    Durbin-Watson test

data:  M3
DW = 2.0422, p-value = 0.5624
alternative hypothesis: true autocorrelation is greater than 0

> acf(res_data3)
> shapiro.test(res_data3)

    Shapiro-Wilk normality test

data:  res_data3
W = 0.96526, p-value = 0.2041

> vif(M3)
     ln.circ     percmale medianincome     domestic 
    1.404547     1.498570     1.710583     1.110405

Residuals Plot QQ Plot Residuals vs Predicted ACF Plot

Residuals plot does not show heteroskedasticity. The assumption of equal error variances is reasonable. Except for some extreme value of residuals, QQ plot shows normality of the data. The residuals plot shows a random pattern. This model meets the linearity assumption.

for (i in 2:5) {
  plot(mag_data3[,i],res_data3,xlab = colnames(mag_data3)[i], ylab = "residuals")}

Plotting residuals against predictor variables, we don’t notice any apparent pattern. However, if there’s room for improvement, we may want to try transformation on percmale variable. Durbin Watson test shows a p-value of 0.5624. So, we can not reject the null hypothesis of 0 autocorrelation. However, ACF plot does not show autocorrelation.

> newdata2 <-newdata
> newdata2$circ = log(newdata2$circ)
> colnames(newdata2)[1] = "ln.circ"
> #transform the newdata to feed the predict method
> predict(M3,newdata = newdata2, se.fit = TRUE, interval = "confidence")
$fit
       fit      lwr      upr
1 10.87309 10.58166 11.16453

$se.fit
[1] 0.144082

$df
[1] 39

$residual.scale
[1] 0.2457245

Outliers

I create a function to automatically remove outliers based on Cook's Distance. This function first run linear model with the response variable being the 1st column and the explanatory variables being the remaining columns. After removing outliers, the function runs linear model again and extract the summary. In practice, we need to assess the model again every time we remove one outliers. However, for the sake of time, and partly my laziness, I here remove them all at once. This function helps us assess new model quickly without having to go through each code line again and again. Let's try it with Model 3

removeoutlier <-function(sampledata) {
  model<-lm( as.formula(paste(colnames(sampledata)[1], "~",
                              paste(colnames(sampledata)[c(2:length(sampledata))], collapse = "+"),
                              sep = "")),data=sampledata)
  paste(colnames(sampledata)[1], "~",
        paste(colnames(sampledata)[c(2:length(sampledata))], collapse = "+"),
        sep = "")

  cook.distance <- cooks.distance(model)
  outlier.index <- cook.distance>(4/44)
  nonoutlier.index <-!outlier.index
  mag_data_outlier_remove <-sampledata[nonoutlier.index,]
  MX<-lm( as.formula(paste(colnames(mag_data_outlier_remove)[1], "~",
                     paste(colnames(mag_data_outlier_remove)[c(2: ncol(mag_data_outlier_remove))], collapse = "+"),
                     sep = ""
    )),
    data=mag_data_outlier_remove
  )
  summary(MX)
}
> removeoutlier(mag_data3)

Call:
lm(formula = as.formula(paste(colnames(mag_data_outlier_remove)[1], 
    "~", paste(colnames(mag_data_outlier_remove)[c(2:ncol(mag_data_outlier_remove))], 
        collapse = "+"), sep = "")), data = mag_data_outlier_remove)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45301 -0.11550 -0.00042  0.13560  0.44817 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.414e+00  5.215e-01  10.382 2.26e-12 ***
ln.circ       5.944e-01  4.430e-02  13.417 1.41e-15 ***
percmale     -4.438e-04  1.611e-03  -0.276   0.7845    
medianincome  1.102e-05  4.808e-06   2.293   0.0278 *  
domestic     -2.167e-02  8.226e-02  -0.263   0.7938    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2195 on 36 degrees of freedom
Multiple R-squared:   0.86,    Adjusted R-squared:  0.8445 
F-statistic: 55.29 on 4 and 36 DF,  p-value: 7.021e-15

After removing outliers, r-square improves up to 0.845

Further Transformation

In this steps we have to try different transformations and assess the R-square. Below is the major steps that lead to the model with the highest r-square.

Model 4 - Adding interactions

> mag_data4<-mag_data3
> mag_data4$dom.ln.circ <-mag_data4$domestic*mag_data4$ln.circ
> mag_data4$dom.percmale <-mag_data4$domestic*mag_data4$percmale
> mag_data4$dom.medianincome <-mag_data4$domestic*mag_data4$medianincome
> #add interaction terms
> M4<-lm(ln.pagecost~.,data = mag_data4)
> summary(M4)

Call:
lm(formula = ln.pagecost ~ ., data = mag_data4)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47604 -0.11733  0.03252  0.12351  0.48747 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.355e+00  1.142e+00   2.938 0.005726 ** 
ln.circ           7.207e-01  1.055e-01   6.828 5.53e-08 ***
percmale          9.997e-04  2.418e-03   0.413 0.681693    
medianincome      2.996e-05  7.352e-06   4.075 0.000242 ***
domestic          2.057e+00  1.305e+00   1.576 0.123765    
dom.ln.circ      -1.181e-01  1.174e-01  -1.006 0.320916    
dom.percmale      8.733e-04  3.407e-03   0.256 0.799175    
dom.medianincome -2.333e-05  9.927e-06  -2.350 0.024374 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2363 on 36 degrees of freedom
Multiple R-squared:  0.8545,    Adjusted R-squared:  0.8262 
F-statistic:  30.2 on 7 and 36 DF,  p-value: 3.083e-13

> removeoutlier(mag_data4)

Call:
lm(formula = as.formula(paste(colnames(mag_data_outlier_remove)[1], 
    "~", paste(colnames(mag_data_outlier_remove)[c(2:ncol(mag_data_outlier_remove))], 
        collapse = "+"), sep = "")), data = mag_data_outlier_remove)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46834 -0.11309  0.03458  0.13354  0.47953 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.910e+00  1.111e+00   1.719  0.09462 .  
ln.circ           8.486e-01  1.018e-01   8.333 9.97e-10 ***
percmale          3.021e-03  2.210e-03   1.367  0.18070    
medianincome      3.454e-05  6.565e-06   5.261 7.87e-06 ***
domestic          3.821e+00  1.246e+00   3.067  0.00422 ** 
dom.ln.circ      -2.745e-01  1.117e-01  -2.457  0.01927 *  
dom.percmale     -3.100e-03  3.138e-03  -0.988  0.33024    
dom.medianincome -2.702e-05  8.751e-06  -3.088  0.00400 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2046 on 34 degrees of freedom
Multiple R-squared:  0.8891,    Adjusted R-squared:  0.8663 
F-statistic: 38.95 on 7 and 34 DF,  p-value: 2.005e-14

Model 5 - Squaring permale

> mag_data5<-mag_data4
> mag_data5$sqr.permale <-mag_data5$percmale*mag_data5$percmale
> #check what the r-square would be after removing outliers
> removeoutlier(mag_data5)

Call:
lm(formula = as.formula(paste(colnames(mag_data_outlier_remove)[1], 
    "~", paste(colnames(mag_data_outlier_remove)[c(2:ncol(mag_data_outlier_remove))], 
        collapse = "+"), sep = "")), data = mag_data_outlier_remove)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40544 -0.06190  0.03097  0.08984  0.45945 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.753e+00  1.125e+00   3.334 0.002172 ** 
ln.circ           6.927e-01  9.910e-02   6.990 6.40e-08 ***
percmale         -2.206e-02  7.070e-03  -3.120 0.003816 ** 
medianincome      3.271e-05  6.408e-06   5.104 1.47e-05 ***
domestic          2.204e+00  1.206e+00   1.827 0.077037 .  
dom.ln.circ      -1.071e-01  1.079e-01  -0.993 0.328364    
dom.percmale      2.483e-03  3.095e-03   0.802 0.428188    
dom.medianincome -2.699e-05  8.142e-06  -3.314 0.002290 ** 
sqr.permale       2.457e-04  6.759e-05   3.634 0.000967 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1761 on 32 degrees of freedom
Multiple R-squared:  0.9192,    Adjusted R-squared:  0.899 
F-statistic: 45.51 on 8 and 32 DF,  p-value: 2.521e-15

After adding some interaction variables and transforming permale variable, we come up with this new model with an improved r-square and adjusted r-square.

> mag_data6<-mag_data5[,-c(6,7)]
> #removing insignificant variables
> head(mag_data6)
  ln.pagecost   ln.circ percmale medianincome domestic dom.medianincome sqr.permale
1    10.13915  7.405496     51.1        38787        1            38787     2611.21
2    12.19602 10.457286     22.1        41933        1            41933      488.41
3    11.54539  8.468003     68.1        63667        1            63667     4637.61
4    11.45211  9.645494     17.3        44237        1            44237      299.29
5    10.92486  8.225503     12.5        47211        1            47211      156.25
6    10.60547  7.814400     60.4        47579        1            47579     3648.16
> M6<-lm(ln.pagecost~.,data=mag_data6)
> #Final Model
> plot(resid(M6))
> abline(h=0,col="red")

> newdata3<-newdata2
> newdata3$dom.medianincome <-newdata3$domestic*newdata3$medianincome
> newdata3$sqr.permale <-newdata3$percmale*newdata3$percmale
> colnames(newdata3)==colnames(mag_data6)[-1]
[1] TRUE TRUE TRUE TRUE TRUE TRUE
> #transform the newdata to feed the predict method
> predict(M6,newdata = newdata3, se.fit = TRUE, interval = "confidence")
$fit
       fit      lwr      upr
1 11.18268 10.80804 11.55731

$se.fit
[1] 0.1848971

$df
[1] 37

$residual.scale
[1] 0.2185238

Summary

The last model best fits the data since it explains over 90% the Ad cost for 1 page.
Model 3 is also a good candidate since the r-square is over 85%
Model 1 is also a decent model with an r-square of over 80%. We should also mention the advantage of model one with its simple form. The regression model of pagecost = 7179.4408 + 3.9633_Circ - 32.4809_percmale + 0.7001_medianincome - 640.5486_domestic is very simple and easy to explain. For example, by increasing the circulation by 1000, Dr. Sam can boost the ad cost up by almost $4000.
Model 3 and model 6 are relatively complicated model with complicated relationship between each variable. With these models, we can’t tell easily what would happen with ad cost if we change predictor variables. Besides, model 6 predicts a wider interval compared to Model 3. This should also be considered.

PreviousCohort Analysis - LTV - LTV by Acquisition Channel with Python (incomplete)NextTime Series Forecasting with Excel and R

Last updated 6 years ago